-
-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixes #96 NewConnPipe can wait indefinitely on peer #102
Conversation
@grrtrr As this was your concern, I would appreciate it you could both review, and test this. Sorry it took so long for me to find cycles to get to this -- it only took a few hours to put together, but life has been kind of hectic lately. |
d0b9ff0
to
950a583
Compare
@gdamore This is great news and I am happy to help looking through this and with testing. It could be a little time due to other commitments, but I will be starting with this within the next days. |
Codecov Report
@@ Coverage Diff @@
## master #102 +/- ##
=========================================
Coverage ? 56.99%
=========================================
Files ? 41
Lines ? 5460
Branches ? 0
=========================================
Hits ? 3112
Misses ? 2085
Partials ? 263
Continue to review full report at Codecov.
|
The AppVeyor builds are all totally busted -- I think we have timing sensitive things in the test suite, and AppVeyor seems to run too slowly -- my guess is that there are reconnection things that take too long, or that we have delays expecting connections to be up, and those aren't, which ultimately lead to these failures. On my Windows 10 laptop, the test suites pass flawlessly. We may need to examine which of the tests are taking too long on AppVeyor. I also want to look at moving the tests into individual protocol directories instead of the common tests tree, which may facilitate isolating them a bit more. That's not for right now though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solves the head-of-line blocking problem in Listen().
If Dial() hangs, it would still block.
Goroutine leak in worker - brings up the question of conn.SetDeadline
again (it would time out the read/write operations on the bad connection).
Did not look much at IPC, since problem was seen on tls+tcp; tcp could be similar.
|
||
func (h *connHandshaker) worker(conn connHandshakerPipe) { | ||
item := &connHandshakerItem{c: conn} | ||
item.e = conn.handshake() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the conn.handshake()
never returns due to a bad connection (as in #96), the worker()
goroutine would leak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handshake should fail eventually. Arguably this might be a case where the timeouts would be useful ... to detect a stuck TCP session. That said, normally that shouldn't be an issue. Typically the kernel will eventually detect that the TCP peer is dead.
Having said that, I'd be willing to set some kind of reasonable timeout on the handshake in particular. That's probably a different case. (Probably the timeout should be a small number of seconds -- under normal circumstances the handshake should complete within a few round trip times.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to handle setting the timeout for this (to close the potential leak of the file and associated go routine) separately. That "leak" (I'd argue its not really a leak, as the peer could actually wake up and complete the operation some long time - days even! - later) is a far less bad situation than we have where a the bad peer would actually gum up everything immediately, which this PR fixes.
@gdamore - I want to thank you for your willingness and perseverance to address this issue. I am concerned about the complexity, since all of this will interact with the other moving parts of mangos. Using timeouts for the r/w operations would also address the 'blocked' problem that this PR addresses. It is simpler and would also help to recover from stuck connections. I don't know how long it takes the kernel to do this, but I think it has to to with keep-alive timeout, which could be hours. Timeouts would be simpler and likewise address the problem. When it comes to testing, I am sorry I misunderstood initially. Our system is still running on mangos-v1, I need to get buy-in here to upgrade. This means talking back and forth, I might be able to run independent tests, but they would not be as meaningful. |
@grrtrr I'm fairly confident that the moving parts are isolated from this -- the timeouts may help, and I'm open to adding them in a follow up PR. I particularly believe that the handshake portion of the setup should have a reasonably aggressive timeout, because there is no reason for a peer to timeout -- probably I'd set that up in the range of a single digit number of seconds. (I'm thinking like "5".) Still, setting up those timeouts was not itself sufficient, to my thinking. In particular, it seems really onerous that a slow peer can interfere with (or prevent) a well-behaved one from connecting. I'm likely to move ahead with this PR for mangos v2. I may follow up (probably will if nobody else does) a follow up to add a timeout during the handshake. (connection timeouts during the rest of the protocol may be tunable too, but I consider that orthogonal as there are perfectly sane situations where a connection can be long lived with no traffic, but in other situations where that would be inappropriate.) For mangos v1, we'd probably settle for something much simpler -- like a default timeout during the handshake phase only. That might help you over the hump. Again, I'd set that for something like 5 seconds. |
@gdamore Ok, I have to step back to this conversion to explain why I am uncomfortable with the "correct" fix. I do not understand your aversion against read/write timeouts. In the above you are saying you would be ok to adding a timeout to the handshake portion. Good, if we are using a timeout for that, why can we not use r/w timeouts in the first place? To avoid using them, additional code is added, which has to work with 4 different socket types and 22 different protocols (number of directories I counted underneath Going back to that earlier conversation, you mentioned two design decisions that in my understanding you considered essential:
Both are creating issues for us. First the REQ/REP socket type uses short-lived connections (1) for doing reliable RPC. This is our main use-case. I have not used the other socket types much, but imagine In addition, there are 2 advantages of using r/w timeouts. First, the ( |
@grrtrr So I need to respond to a few things. First, I want to elaborate that I don't believe that this PR should preclude the addition of timeouts, for users that want to use them. But the key issue for me is that a single peer can interfere with forward progress by another peer, is not adequately resolved by timeouts. I also think relying upon timeouts as a silver bullet is misguided, and far less safe than you believe -- this based on my deep experience, but I will elaborate specifically here.
So, having said all that, I don't believe that timeouts alone are a solution to the problem. They might improve a bad situation, but they are nothing more than a band-aid over a design defect. I prefer to address the defect at it's root, which is the failure to run this handshake truly concurrently. Now, the other side of the question, is whether timeouts have uses in addition to this PR. I believe that they are in some cases useful. Certainly, the handshaking phase should never stall for long periods of time, and it makes sense to reap connections where the peer is not making progress in a timely fashion. So I do support (as I've said above) the idea of adding a timeout to the handshake phase -- probably 5 seconds is about right. This also would address the concern about a synchronous Dialer getting stuck by a hung server. (For various reasons, I still believe Dialing should be done synchronously -- the poor impact a bad server has on a dialer is not going to effect anything other than the go routine doing the Dial, which should only impact a single peer on the socket.) As to timeouts post handshake, those are not at all a cut and dried answer. There are use cases where very long periods of inactivity are perfectly reasonable. We use keep alives today to prevent them from getting stuck forever, but generally that shouldn't be a concern for most use cases. A peer pipe stuck in normal (post-handshake) phase communications will generally not impact more than a single (or possibly two) message worth of transaction, because until the message is properly delivered, flow control conditions will prevail and the pipe will not be returned to the ready queue for the socket. This is a data loss situation, but the upper layer protocols should deal with this (either because they are lossy by design such as PUB/SUB, or they have their own recovery mechanisms such as REQ/REP.) So really the only consequence for a stuck pipe in this situation is the loss of access to the resources (file descriptor, memory) that are consumed. I do recognize that there are deployment situations where even the resource leak is a concern, and where it also might be desirable to more proactively prune sessions than normal keep alives would give us. A classic situation might be where you have a great number of mobile clients, that might roam off network, be powered down, etc. If you know that this is a concern for your application, then a non-handshake timeout makes sense still, but I would consider that (and the desire for a handshake phase timeout as well) as orthogonal issues to be addressed separately from the core design defect that this PR addresses. |
I will be merging this PR. I've heard your concerns, and posted my response. We can follow up with PRs to add timeouts after if you like. |
@gdamore - Thank you for the response. Since we seem to be talking from different viewpoints, I would like to clarify. First, when it comes to the design of the software (points 1,2) - that is clearly your call, and not for me to say. We are grateful for all the work you have been putting in to make this implementation as flawless and robust to a high standard. With regard to (3,4), yes it implies a default timeout. In addition to the problem you mentioned, it can be tricky to come up with the right one. In all this I want to clarify that by timeout in this discourse I only mean the read/write timeout on an established TCP connection. This brings me to why I think that timeouts do have a place. When doing a read or write on an established TCP connection, the assumption without a timeout is that the network is reliable. The number of middleboxes and boxes on the path of a connection keeps growing, due to the use of network virtualization (VPC) and software-defined networking. So in regard to (5), the network is not something the software can control. It took us quite a while to fine-tune the timeouts that the REQ/REP socket uses (and it has quite a few), so I was already used to thinking in timeouts to deal with networking situations. By implication I thought you would consider timeouts as part of the whole picture. Network deployments differ, so the ability to adapt parameters to a particular network is important (compare e.g. backbone with mobile clients). |
@grrtrr if we want to introduce timeouts as well as this PR, as I indicated, I'm fine with that -- I don't believe they should be on by default, but introducing a tunable to give applications an ability to prevent those days-long hangs is a good idea. (Arguably if this is occurring then something is horribly wrong with the TCP keep-alive implementation in the peer systems. This should have nothing to do with middleboxes unless they are screwing around with the TCP connection itself, which would be incredibly bad form. I can see some kind of TCP proxy mishandling this though.) Nonetheless, I will consider timeouts (both during handshake, and on established connections) as a separate issue, to be handled via a separate PR. |
This fixes an important security and reliability concern -- a some of the early negotiation logic was on a synchronous path for dialing, and this could allow a bad peer to either slow or completely stall out any new connections from being established.
The Dialer side is still synchronous, by necessity, as it only opens a single connection at a time.